2,240 research outputs found

    Treating phenotype as given: a simple resampling method for genome-wide association studies

    Get PDF
    Significance of genetic association to a marker has been traditionally evaluated through statistics that are standardized such that their null distributions conform to some known ones. Distributional assumptions are often required in this standardization procedure. Based on the observation that the phenotype remains the same regardless of the marker being investigated, we propose a simple statistic that does not need such standardization. We propose a resampling procedure to assess this statistic’s genome-wide significance. This method has been applied to replicate 2 of the Genetic Analysis Workshop 17 simulated data on unrelated individuals in an attempt to map phenotype Q2. However, none of the selected SNPs are in genes that are disease-causing. This may be due to the weak effect that each genetic factor has on Q2

    Calibrating the Performance of SNP Arrays for Whole-Genome Association Studies

    Get PDF
    To facilitate whole-genome association studies (WGAS), several high-density SNP genotyping arrays have been developed. Genetic coverage and statistical power are the primary benchmark metrics in evaluating the performance of SNP arrays. Ideally, such evaluations would be done on a SNP set and a cohort of individuals that are both independently sampled from the original SNPs and individuals used in developing the arrays. Without utilization of an independent test set, previous estimates of genetic coverage and statistical power may be subject to an overfitting bias. Additionally, the SNP arrays' statistical power in WGAS has not been systematically assessed on real traits. One robust setting for doing so is to evaluate statistical power on thousands of traits measured from a single set of individuals. In this study, 359 newly sampled Americans of European descent were genotyped using both Affymetrix 500K (Affx500K) and Illumina 650Y (Ilmn650K) SNP arrays. From these data, we were able to obtain estimates of genetic coverage, which are robust to overfitting, by constructing an independent test set from among these genotypes and individuals. Furthermore, we collected liver tissue RNA from the participants and profiled these samples on a comprehensive gene expression microarray. The RNA levels were used as a large-scale set of quantitative traits to calibrate the relative statistical power of the commercial arrays. Our genetic coverage estimates are lower than previous reports, providing evidence that previous estimates may be inflated due to overfitting. The Ilmn650K platform showed reasonable power (50% or greater) to detect SNPs associated with quantitative traits when the signal-to-noise ratio (SNR) is greater than or equal to 0.5 and the causal SNP's minor allele frequency (MAF) is greater than or equal to 20% (N = 359). In testing each of the more than 40,000 gene expression traits for association to each of the SNPs on the Ilmn650K and Affx500K arrays, we found that the Ilmn650K yielded 15% times more discoveries than the Affx500K at the same false discovery rate (FDR) level

    A Bayesian method for evaluating and discovering disease loci associations

    Get PDF
    Background: A genome-wide association study (GWAS) typically involves examining representative SNPs in individuals from some population. A GWAS data set can concern a million SNPs and may soon concern billions. Researchers investigate the association of each SNP individually with a disease, and it is becoming increasingly commonplace to also analyze multi-SNP associations. Techniques for handling so many hypotheses include the Bonferroni correction and recently developed Bayesian methods. These methods can encounter problems. Most importantly, they are not applicable to a complex multi-locus hypothesis which has several competing hypotheses rather than only a null hypothesis. A method that computes the posterior probability of complex hypotheses is a pressing need. Methodology/Findings: We introduce the Bayesian network posterior probability (BNPP) method which addresses the difficulties. The method represents the relationship between a disease and SNPs using a directed acyclic graph (DAG) model, and computes the likelihood of such models using a Bayesian network scoring criterion. The posterior probability of a hypothesis is computed based on the likelihoods of all competing hypotheses. The BNPP can not only be used to evaluate a hypothesis that has previously been discovered or suspected, but also to discover new disease loci associations. The results of experiments using simulated and real data sets are presented. Our results concerning simulated data sets indicate that the BNPP exhibits both better evaluation and discovery performance than does a p-value based method. For the real data sets, previous findings in the literature are confirmed and additional findings are found. Conclusions/Significance: We conclude that the BNPP resolves a pressing problem by providing a way to compute the posterior probability of complex multi-locus hypotheses. A researcher can use the BNPP to determine the expected utility of investigating a hypothesis further. Furthermore, we conclude that the BNPP is a promising method for discovering disease loci associations. © 2011 Jiang et al

    Assessing Significance in High-Throughput Experiments by Sequential Goodness of Fit and q-Value Estimation

    Get PDF
    We developed a new multiple hypothesis testing adjustment called SGoF+ implemented as a sequential goodness of fit metatest which is a modification of a previous algorithm, SGoF, taking advantage of the information of the distribution of p-values in order to fix the rejection region. The new method uses a discriminant rule based on the maximum distance between the uniform distribution of p-values and the observed one, to set the null for a binomial test. This new approach shows a better power/pFDR ratio than SGoF. In fact SGoF+ automatically sets the threshold leading to the maximum power and the minimum false non-discovery rate inside the SGoF' family of algorithms. Additionally, we suggest combining the information provided by SGoF+ with the estimate of the FDR that has been committed when rejecting a given set of nulls. We study different positive false discovery rate, pFDR, estimation methods to combine q-value estimates jointly with the information provided by the SGoF+ method. Simulations suggest that the combination of SGoF+ metatest with the q-value information is an interesting strategy to deal with multiple testing issues. These techniques are provided in the latest version of the SGoF+ software freely available at http://webs.uvigo.es/acraaj/SGoF.htm

    More Powerful and Reliable Second-level Statistical Randomness Tests for NIST SP 800-22

    Get PDF
    Random number generators (RNGs) are essential for cryptographic systems, and statistical tests are usually employed to assess the randomness of their outputs. As the most commonly used statistical test suite, the NIST SP 800-22 suite includes 15 test items, each of which contains two-level tests. For the test items based on the binomial distribution, we find that their second-level tests are flawed due to the inconsistency between the assessed distribution and the assumed one. That is, the sequence that passes the test could still have statistical flaws in the assessed aspect. For this reason, we propose Q-value as the metric for these second-level tests to replace the original P-value without any extra modification, and the first-level tests are kept unchanged. We provide the correctness proof of the proposed Q-value based second-level tests. We perform the theoretical analysis to demonstrate that the modification improves not only the detectability, but also the reliability. That is, the tested sequence that dissatisfies the randomness hypothesis has a higher probability to be rejected by the improved test, and the sequence that satisfies the hypothesis has a higher probability to pass it. The experimental results on several deterministic RNGs indicate that, the Q-value based method is able to detect some statistical flaws that the original SP 800-22 suite cannot realize under the same test parameters

    Towards Accurate Estimation of the Proportion of True Null Hypotheses in Multiple Testing

    Get PDF
    BACKGROUND: Biomedical researchers are now often faced with situations where it is necessary to test a large number of hypotheses simultaneously, eg, in comparative gene expression studies using high-throughput microarray technology. To properly control false positive errors the FDR (false discovery rate) approach has become widely used in multiple testing. The accurate estimation of FDR requires the proportion of true null hypotheses being accurately estimated. To date many methods for estimating this quantity have been proposed. Typically when a new method is introduced, some simulations are carried out to show the improved accuracy of the new method. However, the simulations are often very limited to covering only a few points in the parameter space. RESULTS: Here I have carried out extensive in silico experiments to compare some commonly used methods for estimating the proportion of true null hypotheses. The coverage of these simulations is unprecedented thorough over the parameter space compared to typical simulation studies in the literature. Thus this work enables us to draw conclusions globally as to the performance of these different methods. It was found that a very simple method gives the most accurate estimation in a dominantly large area of the parameter space. Given its simplicity and its overall superior accuracy I recommend its use as the first choice for estimating the proportion of true null hypotheses in multiple testing

    Effects of dependence in high-dimensional multiple testing problems

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We consider effects of dependence among variables of high-dimensional data in multiple hypothesis testing problems, in particular the False Discovery Rate (FDR) control procedures. Recent simulation studies consider only simple correlation structures among variables, which is hardly inspired by real data features. Our aim is to systematically study effects of several network features like sparsity and correlation strength by imposing dependence structures among variables using random correlation matrices.</p> <p>Results</p> <p>We study the robustness against dependence of several FDR procedures that are popular in microarray studies, such as Benjamin-Hochberg FDR, Storey's q-value, SAM and resampling based FDR procedures. False Non-discovery Rates and estimates of the number of null hypotheses are computed from those methods and compared. Our simulation study shows that methods such as SAM and the q-value do not adequately control the FDR to the level claimed under dependence conditions. On the other hand, the adaptive Benjamini-Hochberg procedure seems to be most robust while remaining conservative. Finally, the estimates of the number of true null hypotheses under various dependence conditions are variable.</p> <p>Conclusion</p> <p>We discuss a new method for efficient guided simulation of dependent data, which satisfy imposed network constraints as conditional independence structures. Our simulation set-up allows for a structural study of the effect of dependencies on multiple testing criterions and is useful for testing a potentially new method on <it>π</it><sub>0 </sub>or FDR estimation in a dependency context.</p

    Re-sampling strategy to improve the estimation of number of null hypotheses in FDR control under strong correlation structures

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>When conducting multiple hypothesis tests, it is important to control the number of false positives, or the False Discovery Rate (FDR). However, there is a tradeoff between controlling FDR and maximizing power. Several methods have been proposed, such as the q-value method, to estimate the proportion of true null hypothesis among the tested hypotheses, and use this estimation in the control of FDR. These methods usually depend on the assumption that the test statistics are independent (or only weakly correlated). However, many types of data, for example microarray data, often contain large scale correlation structures. Our objective was to develop methods to control the FDR while maintaining a greater level of power in highly correlated datasets by improving the estimation of the proportion of null hypotheses.</p> <p>Results</p> <p>We showed that when strong correlation exists among the data, which is common in microarray datasets, the estimation of the proportion of null hypotheses could be highly variable resulting in a high level of variation in the FDR. Therefore, we developed a re-sampling strategy to reduce the variation by breaking the correlations between gene expression values, then using a conservative strategy of selecting the upper quartile of the re-sampling estimations to obtain a strong control of FDR.</p> <p>Conclusion</p> <p>With simulation studies and perturbations on actual microarray datasets, our method, compared to competing methods such as q-value, generated slightly biased estimates on the proportion of null hypotheses but with lower mean square errors. When selecting genes with controlling the same FDR level, our methods have on average a significantly lower false discovery rate in exchange for a minor reduction in the power.</p

    A constrained polynomial regression procedure for estimating the local False Discovery Rate

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the context of genomic association studies, for which a large number of statistical tests are performed simultaneously, the local False Discovery Rate (<it>lFDR</it>), which quantifies the evidence of a specific gene association with a clinical or biological variable of interest, is a relevant criterion for taking into account the multiple testing problem. The <it>lFDR </it>not only allows an inference to be made for each gene through its specific value, but also an estimate of Benjamini-Hochberg's False Discovery Rate (<it>FDR</it>) for subsets of genes.</p> <p>Results</p> <p>In the framework of estimating procedures without any distributional assumption under the alternative hypothesis, a new and efficient procedure for estimating the <it>lFDR </it>is described. The results of a simulation study indicated good performances for the proposed estimator in comparison to four published ones. The five different procedures were applied to real datasets.</p> <p>Conclusion</p> <p>A novel and efficient procedure for estimating <it>lFDR </it>was developed and evaluated.</p
    corecore